4. Interpreting Linear Models

  • model tuning and stability

  • interpretability

  • application to case study

Learning outcomes

  1. Describe the theoretical foundation of intrinsically interpretable models like sparse regression, gaussian processes, and classification and regression trees, and apply them to realistic case studies with appropriate validation checks.
  2. Compare the competing definitions of interpretable machine learning, the motivations behind them, and metrics that can be used to quantify whether they have been met.

Tuning

Lasso Paths

  • \(\lambda\) is a hyperparameter that controls model complexity.

  • As \(\lambda \uparrow \infty\), all coefficients shrink toward zero. As \(\lambda \downarrow 0\), we return to OLS.

  • We can study the “order” that variables enter the model as we gradually decrease \(\lambda\). We expect the most important predictors (e.g., drug response-related genes) to enter first.

Choosing \(\lambda\)

  • What is an appropriate “budget” for model complexity?

  • Bias-Variance Trade-off: Decreasing \(\lambda\) reduces bias but increases the variance of the predictions.

  • We need to tune \(\lambda\) to balance these competing issues and achieve good performance on holdout samples.

Cross-validation

Split the data into \(K\) folds (e.g., \(K = 5\)). Fit the model on \(K−1\) folds and tested on the remaining fold. This mimics the setting of gathering new data and testing the model on that data.

Hyperparameter Selection

  • \(\lambda_{\text{min}}\): Minimizes the cross-validation error.

  • \(\lambda_{\text{1se}}\): The simplest model, i.e., largest \(\lambda\), whose error is within one standard error of the minimum. This is a sparser model with comparable performance to the best.

Stability

Bootstrap

  1. Any statistical analysis depends on the particular data observed. Different data would yield different results.

  2. The bootstrap estimates this variability without collecting new data.

  3. In regularized regression, we quantify uncertainty in both predictions \(\hat{y}_i\) and coefficients \(\beta_k\).

Bootstrap Mechanism

Simulate new datasets by resampling the original data with replacement. For each bootstrap iteration \(b = 1, \dots, B\):

  1. Draw \(n\) samples with replacement from \(\{(y_i, x_i)\}_{i=1}^n\) to form

\[\begin{align*} \mathbf{y}^{b} = \begin{pmatrix}y_{1}^{b} \\ \vdots \\ y_{n}^{b}\end{pmatrix}, \quad \mathbf{X}^{b} = \begin{pmatrix}x_{1}^{b} \\ \vdots \\ x_{n}^{b}\end{pmatrix} \end{align*}\]

  1. Refit the model on \((\mathbf{y}^{b}, \mathbf{X}^{b})\) to obtain \(\hat{\beta}^{b}\) and \(\hat{y}_i^{b}\).

Bootstrap Analysis

  • Distribution of \(\{\hat{\beta}_{k}^{b}\}_{b=1}^B\) reveals stability of coefficient \(k\)
  • Distribution of \(\{\hat{y}_{i}^{b}\}_{b=1}^B\) reveals stability of prediction for observation \(i\)

Prediction Stability Plots

Interpretation

Intrinsic Interpretability

These were the properties of intrinsically interpretable models we introduced earlier.

  • Sparsity: The model uses only a few of the many candidate features.
  • Simulatability: The ability to manually calculate the outcome from the inputs.
  • Modularity: Each component can be understood independently from the others.

Parsimony

  • The \(\ell^{1}\)-penalized linear model focuses our attention on a small subset of selected features where \(\hat{\beta}_{j} = 0\).

  • The larger the \(\lambda\), the more interpretable the model becomes, since the support set \(S = \{j : \hat{\beta}_{j} \neq 0\}\) shrinks.

Simulatability

  • A linear model \(\hat{y} = \hat{\beta}_{0} + \sum_{j \in S} x_j \hat{\beta}_{j}\) is reasonably simulatable when \(|S|\) is small.

  • But for even moderate to large \(|S|\), simulatability becomes a challenge even for linear regression (Slack et al. 2019)!

Modularity

  • The model is additive: \(f(x) = \sum_{j = 1}^{J}f_j(x_j)\).

  • We can inspect the influence of gene \(j\) using a single coefficient \(\beta_{j}\).

  • But even in linear regression, ceteris paribus means each coefficient has to be understood relative to all other features in the model.

Global & local interpretations

  • Global: \(\hat{\beta}\) helps us identify the most influential genes across the study.

  • Local: For cell \(i\), the \(x_{ij}\hat{\beta}_{j}\) terms help us understand its particular viability prediction.

Case Study

Case study formulation

  • Response (\(y\)): Cell viability after drug treatment.

  • Predictors (\(X\)): High-dimensional molecular profiles \(N = 121, J = 9553\).

  • Goal: Identify a sparse set of features that distinguish drug sensitive vs. resistant cells.

Model fitting

We use the glmnet package in R:

# Fit the lasso path and perform 10-fold CV
cv_fit <- cv.glmnet(X, y, alpha = 1, standardize = TRUE)

This solves the optimization problem for a sequence of \(\lambda\) values using cyclical coordinate descent.

Evaluation criteria

These boxplots show the performance on the holdout folds across a range of \(\lambda\) complexity parameters. For each \(k = 1, \dots, 10\), we compute

\[\begin{align*} CV_{k}\left(\lambda\right) = \frac{1}{N} \sum_{i \in I_{k}} \left(y_{i} - \hat{y}_{i}^{-k}\left(\lambda\right)\right)^2 \end{align*}\] where \[\begin{align*} y_{i}^{-k}\left(\lambda\right) := x_{i}^\top \hat{\beta}^{-k}\left(\lambda\right) \end{align*}\] and \(\hat{\beta}^{-k}\left(\lambda\right)\) solves the lasso optimization at hyperparameter \(\lambda\) using data from all folds except \(I_{k}\).

Evaluation criteria

  • The initial decrease is the phase where adding predictors rapidly improves performance. Usually the boxplots increase again for small values of \(\lambda\), at which point the model is overfitting.

  • We use \(\lambda_{1\text{se}}\) to choose the simplest model whose error is within one standard error of the minimum MSE. This is the point where adding new genes doesn’t significantly add to the holdout prediction accuracy.

Discussion

Respond to [Evaluation Critique] in the exercise sheet.

Case Study

Data preparation

  • It’s important to transform \(x_{ij} \to \frac{x_{ij} - \bar{x}_j}{\hat{\sigma}_j}\).

  • The penalty \(\lambda \sum |\beta_j|\) treats all \(\beta_j\) equally. If \(x_j\) has a large scale, its \(\beta_j\) will be small, making it “cheaper” for the penalty to keep it in the model.

Coefficient Paths

  • We can visualize how the ridge coefficients \(\hat{\beta}_j\) evolves as \(\lambda\) decreases.

  • Each line represents a gene (RNA or methylation feature). The order in which they “emerge” from zero indicates their relative importance in predicting drug response.

Coefficient Paths

These are the equivalent results for the lasso. Notice that coefficients stay at exactly 0 for many \(\lambda\).

Lasso vs. Ridge

Takeaways

  • Sparse linear models narrowed us down from 9K+ features to 37 that are enough to predict drug sensitivity with relatively high accuracy.

  • In this problem, standardization was essential. We are lucky that it is the default in glmnet, but this isn’t always the case.

Possible Improvements

  • The original scale of the features might be informative. By standardizing features, we might be giving undue weight to noise features with low variability.

  • Are there distinct gene pathway signatures for different drugs? In this case, we should consider either drug-pathway interaction terms or drug-specific classifiers.

Instability

  • The Problem: When predictors are highly correlated, the \(\hat{\beta}\) estimated by lasso becomes unstable.

  • Small changes in the training data can cause the lasso to switch between two highly correlated genes, leading to conflicting interpretations of the same underlying signal.

Simulation setup

We simulated data where \(Cor\left(x_{2}, x_{3}\right) > 0.95\). The true coefficients are \(\beta_{1} = \beta_{2} = 2\), \(\beta_{5} = 1.5\), and all other \(\beta_{j} = 0\).

Code: https://go.wisc.edu/m11v16

Simulation results

These are the feature selection frequencies across 1000 bootstrap iterations. Genes 1 and 2 compete for selection.

Takeaway

  • Selection \(\neq\) Importance

    • Just because a gene was not selected by the lasso does not mean it is not biologically relevant – it could be correlated with a gene that’s part of the true biological mechanism.
  • More generally, we should be careful applying sparse models to correlated data. There may be many plausible, equally predictive explanations based on different subsets of features.

Interpretability vs. accuracy trade-offs

  • Often, people assume that to get higher accuracy, we need “black-box” models (like deep learning or Random Forests) and sacrifice interpretability.

  • But in many scientific problems (like today’s case study, see also (Benchmarks - Open Problems in Single-Cell Analysis — Openproblems.bio”)), sparse linear models are surprisingly competitive, while remaining more interpretable to more audiences. It depends on the true relationship between predictors and response in the data.

Exercise

Respond to [Code Analysis] in the exercise sheet.

Benchmarks - Open Problems in Single-Cell Analysis — Openproblems.bio.” https://openproblems.bio/benchmarks.
Slack, Dylan, Sorelle A. Friedler, Carlos Scheidegger, and Chitradeep Dutta Roy. 2019. “Assessing the Local Interpretability of Machine Learning Models.” arXiv. https://doi.org/10.48550/ARXIV.1902.03501.